Skip to content

3610: AVRO-4222: [C++] Support writer to specify compression level#23

Open
martin-augment wants to merge 1 commit intomainfrom
pr-3610-2025-12-31-06-40-42
Open

3610: AVRO-4222: [C++] Support writer to specify compression level#23
martin-augment wants to merge 1 commit intomainfrom
pr-3610-2025-12-31-06-40-42

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3610: To review by AI

@github-actions github-actions bot added the C++ label Dec 31, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 31, 2025

Walkthrough

This pull request adds support for optional compression levels in the Avro C++ DataFile implementation. The changes extend the DataFileWriter API to accept an optional compression level parameter, update DEFLATE and ZSTD compression routines to utilize this parameter, and introduce codec validation and availability utilities. The Codec enum is modified to unconditionally define SNAPPY_CODEC and ZSTD_CODEC. ZstdCompressWrapper's compress method signature is updated to accept an optional compression level. Corresponding public API declarations and constructor signatures are updated across DataFile.hh, and comprehensive test functions are added to validate codec behavior and compression round-trips.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Dec 31, 2025

PR Review: AVRO-4222 - Support writer to specify compression level

Summary

This PR adds support for specifying compression levels when writing Avro data files in C++. The implementation looks solid overall with good test coverage and proper validation.


✅ Strengths

  1. Comprehensive Test Coverage - Excellent property-based testing approach:

    • Random compression level validation (100 iterations)
    • Random round-trip testing with various data sizes
    • Tests for codec availability and enum values
    • Both DEFLATE and ZSTD codecs covered
  2. Good API Design:

    • Uses std::optional<int> for compression level (modern C++ best practice)
    • Defaults to codec-specific defaults when not specified
    • Clear separation of concerns with validateCodec() and isCodecAvailable()
  3. Proper Validation:

    • Validates compression levels at construction time
    • Clear error messages indicating valid ranges
    • Codec availability checks before use
  4. Backward Compatibility:

    • Optional parameter defaults to std::nullopt
    • Existing code continues to work without changes

🔍 Issues & Suggestions

Critical Issues

  1. Duplicate codec metadata setting (lang/c++/impl/DataFile.cc:175-177)

    validateCodec(codec_, compressionLevel_);
    setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);  // Line 176 - Always sets null
    if (codec_ == NULL_CODEC) {
        setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);  // Line 178 - Duplicate!

    The metadata is unconditionally set to AVRO_NULL_CODEC at line 176, then immediately overwritten. Remove line 176.

  2. Missing header include (lang/c++/impl/DataFile.cc)
    The code uses std::optional but doesn't explicitly include <optional>. While it may work through transitive includes, it should be explicitly included for portability:

    #include <optional>

Code Quality

  1. Inconsistent codec validation logic (lang/c++/impl/DataFile.cc:67-109)

    • The validateCodec() function silently ignores compression levels for NULL_CODEC and SNAPPY_CODEC
    • Consider logging or warning users that the level is being ignored
    • Alternative: Throw an exception if level is specified for codecs that don't support it (stricter validation)
  2. Codec name formatting inconsistency (lang/c++/impl/DataFile.cc:67-73)
    The codecToString() helper returns "zstandard" but metadata uses AVRO_ZSTD_CODEC (presumably "zstd"). Ensure consistency with Avro spec naming conventions.

  3. Missing const correctness (lang/c++/impl/DataFile.cc:67-73)
    The codecToString() function could be constexpr for compile-time evaluation:

    constexpr const char* codecToString(Codec codec) {
        // ... implementation
    }

Test Improvements

  1. Non-deterministic test seeding (lang/c++/test/DataFileTests.cc:1289, 1330)

    boost::mt19937 rng(static_cast<uint32_t>(time(nullptr)));

    Using time(nullptr) makes tests non-reproducible. Consider using a fixed seed or allowing seed to be specified via environment variable for debugging:

    const char* seed_env = std::getenv("AVRO_TEST_SEED");
    uint32_t seed = seed_env ? std::atoi(seed_env) : static_cast<uint32_t>(time(nullptr));
    boost::mt19937 rng(seed);
    std::cout << "Using seed: " << seed << std::endl;
  2. Edge case testing gaps:

    • No test for compressionLevel = 0 for DEFLATE (valid but edge case)
    • No test for large file sizes to verify compression actually works under memory pressure
    • No test verifying that different compression levels produce different compressed sizes

Documentation

  1. Missing documentation for new parameter:
    • Add Doxygen comments for compressionLevel parameter in DataFileWriterBase constructor
    • Document valid ranges in header file
    • Example usage would be helpful

🔒 Security Considerations

No security concerns identified. The validation properly prevents:

  • Integer overflow (bounded ranges)
  • Invalid codec usage (availability checks)
  • Buffer issues are handled by underlying zlib/zstd libraries

⚡ Performance Considerations

  1. Compression level validation overhead: Validation happens at construction time, so no runtime overhead during writes - good design.

  2. Memory allocation in ZstdCompressWrapper: The code uses ZSTD_compressBound() to pre-allocate buffers, which is correct. No concerns here.

  3. Default compression levels: Using defaults (Z_DEFAULT_COMPRESSION, ZSTD_CLEVEL_DEFAULT) when unspecified is appropriate for backward compatibility.


📋 Recommendations

Must fix before merge:

  1. Remove duplicate setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC) at line 176

Should fix:
2. Add explicit #include <optional> in DataFile.cc
3. Improve test reproducibility with configurable seeds
4. Add documentation for the new parameter

Nice to have:
5. Consider stricter validation for unsupported codec/level combinations
6. Add edge case tests (level 0, compression ratio verification)
7. Make codecToString() constexpr


📊 Overall Assessment

Quality: 8/10 - Well-implemented feature with excellent test coverage, but needs the critical duplicate metadata bug fixed.

The PR demonstrates good understanding of:

  • Modern C++ practices (std::optional, move semantics)
  • Property-based testing methodology
  • API backward compatibility
  • Error handling and validation

Great work overall! Please address the critical metadata issue and consider the other suggestions.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lang/c++/impl/DataFile.cc (1)

176-188: Redundant metadata assignment.

Line 177 unconditionally sets AVRO_CODEC_KEY to AVRO_NULL_CODEC, but lines 178-188 immediately overwrite it based on the actual codec. The initial assignment on line 177 is unnecessary.

🔎 Suggested fix
     validateCodec(codec_, compressionLevel_);
-    setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);
     if (codec_ == NULL_CODEC) {
         setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a45b25d and f4bb502.

📒 Files selected for processing (5)
  • lang/c++/impl/DataFile.cc
  • lang/c++/impl/ZstdCompressWrapper.cc
  • lang/c++/impl/ZstdCompressWrapper.hh
  • lang/c++/include/avro/DataFile.hh
  • lang/c++/test/DataFileTests.cc
🧰 Additional context used
🧬 Code graph analysis (3)
lang/c++/impl/ZstdCompressWrapper.cc (1)
lang/c++/impl/ZstdCompressWrapper.hh (1)
  • uncompressed (36-36)
lang/c++/test/DataFileTests.cc (2)
lang/c++/include/avro/DataFile.hh (1)
  • isCodecAvailable (48-48)
lang/c++/impl/DataFile.cc (2)
  • isCodecAvailable (113-133)
  • isCodecAvailable (113-113)
lang/c++/include/avro/DataFile.hh (1)
lang/c++/test/DataFileTests.cc (2)
  • codec (225-236)
  • codec (225-225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
🔇 Additional comments (19)
lang/c++/impl/ZstdCompressWrapper.cc (1)

28-40: LGTM!

The updated compress method correctly uses std::optional<int> for the compression level parameter and applies value_or(ZSTD_CLEVEL_DEFAULT) to fall back to ZSTD's default when no level is specified. This aligns with the header declaration and the broader PR goal.

lang/c++/impl/ZstdCompressWrapper.hh (1)

24-36: LGTM!

The header correctly adds <optional> include and updates the compress method signature with a default value of std::nullopt, ensuring backward compatibility for existing callers.

lang/c++/test/DataFileTests.cc (7)

1287-1321: LGTM! Good property-based testing approach.

The validation test comprehensively covers the deflate compression level range (0-9), invalid levels, and the nullopt case. The random sampling provides good coverage.


1323-1359: LGTM!

The ZSTD validation test correctly tests the 1-22 valid range and is properly guarded by ZSTD_CODEC_AVAILABLE.


1361-1413: LGTM!

The round-trip test properly exercises various compression levels including the nullopt case and verifies data integrity. The random data generation and verification logic is sound.


1415-1470: LGTM!

The ZSTD round-trip test correctly handles the valid level range (1-22) and maps 0 to nullopt. Well-structured parallel to the deflate test.


1472-1479: LGTM!

Testing explicit enum values helps ensure ABI stability and prevents accidental changes to the codec ordering.


1481-1498: LGTM!

The test correctly validates runtime codec availability, matching the compile-time configuration via SNAPPY_CODEC_AVAILABLE and ZSTD_CODEC_AVAILABLE guards.


1703-1717: LGTM!

Test registration is complete and properly guards ZSTD-specific tests with ZSTD_CODEC_AVAILABLE.

lang/c++/include/avro/DataFile.hh (4)

38-48: LGTM! Good API design.

Making SNAPPY_CODEC and ZSTD_CODEC unconditionally visible in the enum while providing isCodecAvailable() for runtime checks is a clean design. This allows callers to reference codec values without compile-time guards while still getting proper runtime errors for unavailable codecs.


83-83: LGTM!

The compressionLevel_ member correctly uses std::optional<int> to store the optional compression level.


135-140: LGTM!

The DataFileWriterBase constructor declarations correctly add the optional compression level parameter with a sensible default, maintaining backward compatibility.


174-182: LGTM!

The DataFileWriter template constructors correctly forward the compressionLevel parameter to the base class, maintaining API consistency.

lang/c++/impl/DataFile.cc (6)

67-75: LGTM!

The codecToString helper provides clear string representations for all codec types, useful for error messages.


77-109: LGTM!

The validateCodec function correctly validates codec availability and compression level ranges. The early return for nullopt levels and the per-codec validation logic is well-structured.


113-133: LGTM!

The isCodecAvailable function correctly reports codec availability based on compile-time configuration, enabling runtime checks before codec usage.


229-232: LGTM!

The deflate path correctly uses value_or(Z_DEFAULT_COMPRESSION) to apply the specified level or fall back to zlib's default compression level.


317-318: LGTM!

The ZSTD path correctly passes the optional compressionLevel_ to the wrapper, which handles the default internally.


135-167: LGTM!

Both DataFileWriterBase constructors correctly initialize the new compressionLevel_ member and maintain consistency between the file-based and stream-based variants.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Dec 31, 2025

🤖 Augment PR Summary

Summary: Adds optional compression-level support for the C++ Avro data-file writer, primarily for deflate and zstandard.

Changes:

  • Extends DataFileWriterBase/DataFileWriter constructors with std::optional<int> compressionLevel.
  • Adds isCodecAvailable(Codec) and validates codec availability + level ranges during writer initialization.
  • Threads the chosen level into zlib via deflateInit2 (deflate) and into ZstdCompressWrapper::compress (zstandard).
  • Updates the ZSTD compression wrapper API to accept an optional level.
  • Adds unit tests for codec enum stability, availability reporting, compression-level validation, and round-trip read/write across varied levels.

Notes: When no level is provided, the library defaults (Z_DEFAULT_COMPRESSION/ZSTD_CLEVEL_DEFAULT) are used; invalid levels throw avro::Exception.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

setMetadata(AVRO_CODEC_KEY, AVRO_DEFLATE_CODEC);
#ifdef SNAPPY_CODEC_AVAILABLE
} else if (codec_ == SNAPPY_CODEC) {
setMetadata(AVRO_CODEC_KEY, AVRO_SNAPPY_CODEC);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This references AVRO_SNAPPY_CODEC (and similarly AVRO_ZSTD_CODEC below) unconditionally, but those metadata strings are still only defined under #ifdef SNAPPY_CODEC_AVAILABLE / #ifdef ZSTD_CODEC_AVAILABLE earlier in this file. As a result, builds without those optional codecs enabled will fail to compile here.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The Pull Request author tried to simplify the code by removing the usage of the build arguments but introduced build errors in case some of the arguments are not defined.

syncInterval_(syncInterval),
codec_(codec),
compressionLevel_(compressionLevel),
stream_(fileOutputStream(filename)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because validateCodec() runs inside init() after stream_(fileOutputStream(filename)) has already executed, an unavailable codec or invalid compression level can still create/truncate the target file before throwing. This is a new side-effect path introduced by the compression-level validation.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The output file is created/truncated and then the validation checks are executed. This is an old issue - it was the same until now if syncInterval was invalid. Prevents creating/truncating the output file if it won't be used due to validation failure.

@martin-augment
Copy link
Copy Markdown
Owner Author

martin-augment commented Jan 2, 2026

176-188: Redundant metadata assignment.

Line 177 unconditionally sets AVRO_CODEC_KEY to AVRO_NULL_CODEC, but lines 178-188 immediately overwrite it based on the actual codec. The initial assignment on line 177 is unnecessary.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! There is no need to check whether the codec is AVRO_NULL_CODEC because it is set as a default anyway. Prevents re-setting the AVRO_NULL_CODEC a second time.

@martin-augment
Copy link
Copy Markdown
Owner Author

martin-augment commented Jan 2, 2026

  1. Duplicate codec metadata setting (lang/c++/impl/DataFile.cc:175-177)
    validateCodec(codec_, compressionLevel_);
    setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);  // Line 176 - Always sets null
    if (codec_ == NULL_CODEC) {
        setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);  // Line 178 - Duplicate!
    The metadata is unconditionally set to AVRO_NULL_CODEC at line 176, then immediately overwritten. Remove line 176.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! There is no need to check whether the codec is AVRO_NULL_CODEC because it is set as a default anyway. Prevents re-setting the AVRO_NULL_CODEC a second time.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Missing header include (lang/c++/impl/DataFile.cc)
The code uses std::optional but doesn't explicitly include <optional>. While it may work through transitive includes, it should be explicitly included for portability:
c++ #include <optional>

value:useful; category:bug; feedback:The Claude AI reviewer is correct! This include is missed in two .cpp files. The build passes because the import comes as a transitive from another import. Prevents a build break if the transitive dependency stops using std::optional and removes its include for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants